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Abstract 


To address the problem of real-time processing of ultra-wide bandwidth pulsar baseband data, we designed and 
implemented a pulsar baseband data processing algorithm (PSRDP) based on GPU parallel computing 
technology. PSRDP can perform operations such as baseband data unpacking, channel separation, coherent 
dedispersion, Stokes detection, phase and folding period prediction, and folding integration in GPU clusters. 
We tested the algorithm using the J0437-4715 pulsar baseband data generated by the CASPSR and Medusa 
backends of the Parkes, and the J0332--5434 pulsar baseband data generated by the self-developed backend of 
the NanShan Radio Telescope. We obtained the pulse profiles of each baseband data. Through experimental 
analysis, we have found that the pulse profiles generated by the PSRDP algorithm in this paper are essentially 
consistent with the processing results of Digital Signal Processing Software for Pulsar Astronomy (DSPSR), 
which verified the effectiveness of the PSRDP algorithm. Furthermore, using the same baseband data, we 
compared the processing speed of PSRDP with DSPSR, and the results showed that PSRDP was not slower 
than DSPSR in terms of speed. The theoretical and technical experience gained from the PSRDP algorithm 
research in this article lays a technical foundation for the real-time processing of QTT (Qi Tai radio Telescope) 
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1. Introduction 


With the rapid development of manufacturing technology 
and information technology, radio telescope receiving system 
has developed toward ultra-wide bandwidth and PAF (phased 
array feed). Multi-beam and ultra-wide bandwidth receiving 
systems make the amount of astronomical data obtained by 
observation increase sharply (Zhang & Duan 2022). For 
example, low-frequency array needs to process about 14 Tb 
raw astronomical signals per second at a sampling rate of 
200 MHz (van Haarlem et al. 2013). The multi-beam backend 
of FAST (Five-hundred-meter Aperture Spherical Telescope; 
Nan et al. 2011) has a data sampling rate of 12.8 GB s ! under 
the condition of 100 us time resolution, 4000 channels and 
dual polarization (Li et al. 2018). The QTT (Qi Tai radio 
Telescope; Wang et al. 2023), which is being built in Xinjiang, 
China, its ultra-wide bandwidth and PAF (van Cappellen et al. 
2022) receiving system will generate an annual archive data 
volume that is expected to exceed 10 PB. The real-time 
processing of ultra-wide bandwidth pulsar data is one of the 
urgent problems to be solved and also the starting point of this 
study. 


methods: data analysis — techniques: miscellaneous 


With the continuous advancement of astronomical research 
and the rapid development of digital technology, the perfor- 
mance requirements of digital backend systems and related 
equipment for astronomical observation are constantly increas- 
ing. Backend systems need to achieve high-speed sampling, 
real-time analysis, and data preprocessing at wider bandwidths, 
higher time resolution, and higher frequency resolution. 
Handling the high-speed data streams generated by PAF and 
multi-beam receiving systems, as well as the real-time 
processing of massive astronomical data on heterogeneous 
platforms, are currently urgent challenges in the operation of 
radio observation facilities. Due to limitations in storage device 
performance, massive astronomical signals can only be 
processed and analyzed in real-time, posing significant 
challenges to the performance of data processing equipment 
(Wei 2019). 

Current mainstream digital backend systems often employ a 
hybrid architecture consisting of FPGA, CPU, and GPU. 
Heterogeneous systems require data to be exchanged between 
different devices. One of the technical challenges in data 
processing systems based on heterogeneous platforms is how to 
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achieve high-speed data flow between the CPU and GPU while 
avoiding issues such as high error rates and packet loss during 
the transmission of massive data. For the high-speed transmis- 
sion and distribution of data, one feasible approach is to create 
a ringbuffer in memory to perform real-time high-speed 
caching of data and transfer the buffer data to the GPU 
memory, ultimately enabling real-time data processing on 
the GPU. 

The use of GPU can significantly enhance the running speed 
of applications in scientific analysis, simulations, and other 
fields. Internationally, real-time processing of pulsar signals 
using CPU--GPU heterogeneous computing platforms has 
become mainstream. Utilizing the computational units of GPUs 
to process observational data can reduce the CPU workload, 
greatly enhancing the data flow processing efficiency of the 
system. As radio astronomy digital backend technology 
continues to advance, real-time signal processing presents a 
serious challenge to traditional computing techniques. The 
powerful computational capabilities of GPU clusters provide a 
feasible solution for handling the massive data streams 
generated during radio astronomy observations. 

The Australian Parkes ultra-wide bandwidth receiver system 
uses an FPGA--GPU architecture for preprocessing and 
recording observational data. It divides the 704—4032 MHz 
signal into three analog RF bands for sampling. In the FPGA 
preprocessing stage, it further divides the data into 26 
continuous subbands, each with a bandwidth of 128 MHz, 
using PFB (polyphase filter bank) technology (Zhang et al. 
2023b). The FPGA packages the data into VDIF format’ and 
transfers it to GPU nodes using the UDP protocol. To handle 
high-speed network flows, each GPU node uses PPRDADA’ to 
establish high-speed ringbuffers to temporarily store the data. 
Subsequently, the data is copied from the buffer to the GPU 
memory for further processing (Hobbs et al. 2020). The GBT 
(Green Bank Telescope) in the United States uses the 
ROACH2°+GPU architecture for processing pulsar signals in 
its VEGAS (Versatile GBT Astronomical Spectrometer) back- 
end. VEGAS’s pipeline software consists of multiple con- 
current threads. The network threads transmit data from 
ROACH2 to a shared buffer in the computing node via a 
10 GbE network link. Then, the reading threads copy the data 
to the GPU memory. After data processing, the results are sent 
to the disk-writing threads for storage on the disk (Chenna- 
mangalam et al. 2014). 

The FAST backend employs an architecture combining 
ROACH2 and GPUs along with high-speed networking to 
fulfill observational requirements such as pulsar searches and 
timing (Nan & Li 2014; You et al. 2021). The Digital Backend 
System of the Shanghai Tian Ma Radio Telescope uses a 
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hybrid FPGA+GPU architecture. Data output by the FPGA is 
transmitted through a 10 GbE network link to eight high- 
performance computers equipped with GPUs for coherent 
dedispersion and incoherent dedispersion (Yan et al. 2017). 
The Yunnan Astronomical Observatory utilizes a pulsar 
observation system with a bandwidth of 512 MHz, 8 bit 
sampling precision, and dual-polarization input, using 
ROACH2 as the baseband data acquisition backend and 
Digital Signal Processing Software for Pulsar Astronomy 
(DSPSR)'° as the data processing core (Xu et al. 2015). This 
system offers 128 channels (each with a 4 MHz bandwidth) for 
processing, including data decoding, coherent dedispersion, 
polarization computation, and folding. The results of data 
processing are stored in the PSRFITS format (Hotan et al. 
2004). The pulsar observation system for the HRT (Haoping 
Radio Telescope) employs the ROACH2+GPU architecture. 
Data output by ROACH2 is transmitted via a 10 GbE link to 
eight GPU nodes, supporting both incoherent dedispersion and 
coherent dedispersion observational modes (Luo et al. 2020). 


2. Parallel Processing Algorithm for Pulsar Baseband 
Data Based on GPU 


To meet the real-time or near-real-time processing require- 
ments of ultra-wide bandwidth pulsar baseband data, we have 
designed and implemented the PSRDP (PulSaR baseband Data 
Processing) algorithm using GPU parallel computing technol- 
ogy. With the powerful computational capabilities of GPUs, the 
algorithm’s efficiency has been greatly improved. GPUs have 
more cores than CPUs and can simultaneously execute more 
threads, making them suitable for handling massive computa- 
tional tasks involving ultra-wide bandwidth pulsar base- 
band data. 

The main process of the PSRDP algorithm is illustrated in 
Figure 1. It starts with reading the pulsar baseband data on the 
HOST, copying the data to the DEVICE, and then performing 
channelization, coherent dedispersion, stokes detection, folding 
integration, and other operations. Once data processing is 
completed, the results are copied back to the HOST for storage. 


2.1. Unpack Baseband Data 


After copying the baseband data to the GPU memory, the 
baseband data needs to be unpacked first. For the baseband data 
recorded by the CASPSR backend, dual polarization 8 bit 
sampling, two polarizations every four data (real numbers) are 
stored alternately. As shown in Figure 2, the dual polarization 
data are unpacked, the data are recombined, and data with the 
same polarization are put together to facilitate subsequent 
processing. In order to realize GPU parallel unpacking, 
Equation (1) is used to establish a relationship between the 
data index after unpacking and the data index before 
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Figure 1. PSRDP data processing flow. 
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where i is the data index before unpacking, j is the data index 
after unpacking, npad — 4 (because dual polarizations are stored 


alternately every four data), if the data block length is N, then 
polsize — N/2. For the dual polarization baseband data recorded 
by the Medusa backend, dual polarization 32 bit sampling, the 
dual polarizations are stored alternately every 2048 data (1024 
complex numbers are also equal to 2048 real numbers), as 
shown in Figure 3. It is similar to the CASPSR unpacking 
process, npad — 2048, and the data needs to be Perform offset 
binary decoding. Each datum is independent, GPU parallel 
technology can be fully utilized to accelerate unpacking. 
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Figure 2. CASPSR backend baseband data unpacking (The data block length is N, M = N/2). 


pol, pol, pol, 


pol, pol, pol, 
a 


4086-6143 


6144-8101 (N-4096)~(N-2049) (N-2048)~(N-1) 


2048~4095 (M-2048)~(M-1) 


M~ (M+2047) 


(N+2048)~ (+4095) | - 


(N-2048)-(N-1) 


! im um 


| | | 


offset binary decode (out = in^0x8000) 


| | | 


| | | 


0-2047 2048-4085 | vs (M-2048)~(M-1) 


M~(M+2047) (M+2048)~(M+4095) (N-2048)~(N-1) 


UU RO RR 


pol, 


pol, 


Figure 3. Medusa backend baseband data unpacking (The data block length is N, M — N/2). 


2.2. Multi-channel Coherent Dedispersion 


During the transmission of pulsar signals to Earth, they 
undergo dispersion due to the influence of the ISM (interstellar 
medium), causing high-frequency signals to arrive at Earth 
before low-frequency signals. The effect of the ISM on radio 
waves is akin to the action of a filter with an ISM transfer 
function. To restore the original characteristics of pulsar 
signals, it is necessary to perform dedispersion, where coherent 
dedispersion involves filtering the observed signal through the 
inverse of the transfer function of the ISM. In theory, coherent 
dedispersion can completely eliminate the dispersion effects 
caused by the ISM in the signal (Zhang et al. 2023a). 

As shown in Figure 4, after performing FFT on the time data 
sequence, it is further channelized into multiple subbands. 
Based on the central frequency, bandwidth, and dispersion 
measure (DM) of each subband, chirp coefficients (the 
inverse function of the ISM transfer function) are generated 


(Lorimer & Kramer 2012), as shown in Equation (2). 


27D 
(fer + MSZ 


H(fa + Af)! = exp] —i DM (Afy |, 


(2) 


In the equation, the reference frequency fier is set as the central 
frequency, Af represents the difference between a specific 
frequency point f and fier, and D is the dispersion constant, which 
is equal to 4.15 x 10° MHZ pe cm? - s. Subsequently, the 
chirp coefficients generated for each subband are multiplied point- 
to-point with the data from each subband to achieve coherent 
dedispersion. Since there is no interdependence between the data, 
this process can be efficiently parallelized on a GPU. 


2.3. Stokes Detection 


After coherent dedispersion, for dual-polarization data obtained 
from two orthogonal polarization channels, poly and pol). 
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Figure 4. Multi-channel coherent dedispersion process. 


Four Stokes parameters are J, Q, U and V (Sutinjo et al. 2022). For 
dual-linear feeds, the Stokes parameters can be calculated using 
Equation (3). For dual-circular feeds, Equation (4) can be used to 
calculate. 


Paa + Ppp 
Paa — Pop 
2 Re(Pyp) 
2 Im(P,p) 


<a ON 


Paa + Pop 
2 Im(P4g) 


T 12Re(X * Y) | e 


Pap — Paa 


<Q Ons 


where P44 and Pgg are the powers of two orthogonal 
polarization channels polo and pol, respectively, Pag is the 
cross-correlation product of dual polarizations, Re(P45) and 
Im(P4g) represent the real part and imaginary part of Pag 
respectively. 


Stokes parameters are completely independent. As shown in 
Figure 5, each datum uses a GPU thread to implement 
calculations. GPU parallel technology can greatly reduce the 
resource loss caused by traditional CPU single thread and 
improve computing efficiency. 


2.4. Folding and Integrating 


Different pulsars have inconsistent periods, and coupled with 
different sampling rates, the number of sampling points in each 
period of the pulsar often cannot be represented by an integral 
power of 2 (in order to optimize the FFT execution, the number 
of input points is best represented by an integral power of 2) , 
which will cause the last remaining data to be discarded or 
processed specifically during the integral calculation process. 
The strategy adopted by this algorithm is to combine folding 
and integration operations and process them at the same time. 

In the CUDA programming model, calculations are sorted 
according to a three-level hierarchy. Each call to the CUDA 
kernel creates a new Grid, which is composed of multiple 
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Figure 5. Parallel detection of Stokes parameters. 
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Figure 6. CUDA three-level thread model. 


Blocks. Each Block is composed of up to 1024 separate 
Threads (Hijma et al. 2023). As shown in Figure 6, Grid can 
control the number of Blocks by setting three-dimensions: 
gridDim.x, gridDim.y and gridDim.z. Similarly, Block can also 
control the number of Threads through blockDim.x, blockDim.y 
and blockDim.z. 

By default, the number of profile data after final folding 
integration is 1024. Therefore, for n channels and four Stokes 
parameters, a space of size 4: 1024 «n needs to be pre- 
allocated to store the final generated profile data. For multi- 
channel multi-polarization data, since the starting recording 
time is the same, the corresponding profile sequence index for 
each group is also the same. Only one set of index sequence 
needs to be generated, and its index sequence generation 
method is shown in Figure 7. 


The period of pulsars is very stable, but the locations of 
observatories, pulse dispersion, orbit parameters of the solar 
system (Romer delay, Shapiro delay and Einstein delay) and 
binary systems can affect the times-of-arrival of pulses, which 
in turn affects the folding period of pulsar data (Pennucci 2015). 
In order to accurately fold the pulse profile and the phase of a 
certain timestamp, it is necessary to predict the folding period 
and the phase of the pulsar at that timestamp (Zhang et al. 
20232). 

When processing the pulsar baseband data, it is necessary to 
divide the baseband data into blocks for further processing. In 
order to accurately fold the pulse profile, it is necessary to 
predict the folding period of the pulse and the relative phase 
value (between 0 and 1) of the first sample point in each block 
according to the starting time of each block. After determining 
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Figure 7. Index sequence generation procedure. 


the phase value of the first sample point, the phase values of all 
sample points in this block after folding can be calculated. The 
phase value of the ith sample point is shown in Equation (5). 


phase, = (phase, + i x phase per. sample) 
— floor (phase, + i x phase per. sample). 
(5) 


Among them, phase, is the phase value of the first sampling 
point, which can be obtained through prediction. The floor() 
function represents rounding down, and phase per sample 
represents the phase size interval of each sampling point, which 
can be expressed as Equation (6). 


ibin — phase; x nbin. (6) 


The same indexes in the index sequence may not be 
clustered together (a piece of data may be more than one 
period). For the folding operation, the corresponding points of 
each period are added and averaged, while the integral is the 
average of multiple consecutive points. The pulse profile index 
generation diagram is shown in Figure 7. We combine the 
folding and integration operations, that is, add the data 
corresponding to the same index and average it. 

During the implementation of the algorithm, direct addition 
of GPU multi-threaded calculation results will lead to error 
results. In this case, atomic addition operations are often used. 
Because atomic addition uses a locking mechanism at the 
bottom layer for data protection, it seriously slows down the 
calculation speed. In order to deal with the above problems, we 
designed and implemented a multi-threaded folding integration 
algorithm based on GPU parallel technology based on the 
folding integration method proposed in Section 7.1 of the 
handbook of pulsar astronomy (Lorimer & Kramer 2012). The 
core idea is that for multi-channel multi-polarized data, 
assuming that for N-channel M-polarized data, the final 
contour data size generated by each polarization is nbin. Use 
GPU multi-threading technology to open N * M x nbin threads, 


each nbin threads process one polarization channel, where each 
polarization performs a serial addition of data with the same 
index within a channel. 

The core idea is that for multi-channel multi-polarized data, 
assuming that for N-channel M-polarized data, the final contour 
data size generated by each polarization is nbin. Use GPU 
multi-threading technology to open N * M x nbin threads. 

In order to implement the above algorithm, set the 
dimensions of Grid: gridDim.x — 1, gridDim.y — nchan(num- 
ber of channels) and gridDim.z — npol (number of polariza- 
tions), and similarly set the dimensions of Block: 
blockDim.x — 1024 (the number of pulse profiles included), 
blockDim.y = 1 and blockDim.z = 1, there are a total of 
npol x nchan x 1024 threads. As shown in Figure 8, the data 
with index 1 in a certain polarization of a certain channel uses a 
thread to perform serial loop addition, avoiding the resource 
overhead problem caused by atomic addition. 


3. Experimental Analysis and Results 
3.1. Parkes Baseband Data Processing 
3.1.1. CASPSR Baseband Data Processing 


The PSRDP was tested using the baseband data generated by 
the CASPSR backend observation of the Parkes. The baseband 
data observation source is J0437-4715, the bandwidth size is 
400 MHz, the bandwidth range is 1182-1582 MHz, and the 
observation time is 8s and the data size is 12.8 GB. Use PSRDP 
to divide it into 1024 channels, and perform Stokes detection, 
coherent dedispersion, integration and folding on each channel. 
Extract the J of the stokes parameters of each channel to obtain 
Figure 9. It can be clearly seen that there is dispersion delay 
between each channel. After performing incoherent dedisper- 
sion, as shown in Figure 10, the profiles of each subband can be 
correctly aligned. As shown in Figure 11, the phases and 
amplitudes of the two profiles are basically consistent, which 
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Figure 8. Multi-channel multi-polarization folding and integrating. 
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Figure 9. Dynamic spectrum of J0437-4715 (before dedispersion). 


verifies the effectiveness of the PSRDP period and phase 
prediction methods. 

We use DSPSR to process the baseband data and also output 
1024 channels. The final profile generated is compared with the 
profile generated by PSRDP. As shown in Figure 10, the 
phases and amplitudes of the two profiles are basically 
consistent, which verifies the effectiveness of the PSRDP 
period and phase prediction methods. 
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Figure 10. Dynamic spectrum of J0437-4715. 


3.1.2. Medusa Baseband Data Processing 


The PSRDP was tested using the baseband data generated by 
the Medusa observation of the Parkes. The baseband data 
observation source is J0437-4715, the bandwidth size is 
128 MHz, the bandwidth range is 704-832 MHz, the observa- 
tion time is about 9.6 s, and the data size is 9.6 GB. The 
baseband data is divided into 128 MHz subbands. The 
processing steps are similar to those in Section 3.1.1. After 
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Figure 11. Comparison of profiles processed by PSRDP and DSPSR (CASPSR 
baseband data). 
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Figure 12. Dynamic spectrum of J0437-4715 (with RFI channel). 


performing incoherent dedispersion, Figure 12 is obtained. 
It can be clearly seen that there is RFI (Radio Frequency 
Interference) in some channels. After removing the interfered 
channels, as shown in Figure 13, a clear pulse profile can 
be seen. 

We use DSPSR to process the baseband data and output 
128 channels. After removing the interference at the same 
position, the final profile generated is compared with the 
contour generated by PSRDP. As shown in Figure 14, the 
phase and amplitude of the two profiles are basically the same. 


3.2. NSRT Baseband Data Processing 


We are based on the self-developed pulsar backend (the 
backend uses three self-developed FPGA development boards 
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Figure 13. Dynamic spectrum of J0437-4715 (without RFI channel). 
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Figure 14. Comparison of profiles processed by PSRDP and DSPSR (Medusa 
baseband data). 


and uses PFB technology to divide the 964-1732 MHz 
bandwidth signal into six 128 MHz bandwidth subbands, and 
each subband is transmitted to the GPU server for processing.), 
using the NSRT (NanShan Radio Telescope) L-band 1 GHz 
bandwidth receiver observation to generate baseband data. The 
observation source is J03324-5434, and the output single- 
channel 128 MHz bandwidth baseband data is dual-polarized 
with 16 bit accuracy, the spectrum range is 1220-1348 MHz, 
and the observation duration is 5 minutes. Using the same data 
processing method as in Section 3.1.1/3.1.2, Figure 15 is 
obtained. After deleting some of the interfered channels, the 
dynamic spectrum obtained is shown in Figure 16. 
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Figure 15. Dynamic spectrum of J0332--5434 (with RFI channel). 
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Figure 16. Dynamic spectrum of J0332--5434 (without RFI channel). 


3.3. PSRDP and DSPSR Speed Comparison 


Regarding the execution time of the algorithm, we used the 
same data to conduct speed tests on PSRDP and DSPSR. The 
number of channels was set to 2, 4, 8, 16, 32, 64, 128, 256, 512 
and 1024 respectively. The test file was CASPSR baseband 
data, after multiple rounds of testing, calculate the average 
processing time. The test results are shown in Figure 17. The 
execution time of PSRDP is not higher than DSPSR. 


4. Conclusion 


In response to the QTT ultra-wide bandwidth pulsar data 
processing requirements, we designed and implemented 
PSRDP based on GPU parallel technology. After the PSRDP 
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Figure 17. PSRDP and DSPSR speed comparison. 
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reads the baseband data at the HOST, it transmits to the 
DEVICE. The wideband signal is further subdivided using 
FFT, and each subband after subdivision is subjected to 
coherent dedispersion, stokes detection, periodic phase predic- 
tion and folding integration operations. Finally, it is transmitted 
back to the HOST for scientific data storage. 

We systematically tested the PSRDP using the J0437-4715 
pulsar baseband data generated by the CASPSR and Medusa 
backends and the J0332--5434 pulsar baseband data generated 
by the self-developed backend of the NSRT. The PSRDP is 
used to output multiple subbands, and profiles between 
multiple subbands are aligned through incoherent dedispersion, 
and the final profile is generated. Compared with DSPSR in 
phase and amplitude, the pulse profile generated by PSRDP has 
no obvious difference and is slightly better than DSPSR in 
terms of algorithm processing speed. By comparing the PSRDP 
experimental results with the DSPSR processing results, the 
effectiveness of the PSRDP algorithm is verified. 
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